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Abstract 

One of the challenges with functional data is incorporating geometric structure, or local 
correlation, into the analysis. This structure is inherent in the output from an increasing number 
of biomedical technologies, and a functional linear model is often used to estimate the relationship 
between the predictor functions and scalar responses. Common approaches to the problem of 
estimating a coefficient function typically involve two stages: regularization and estimation. 
Regularization is usually done via dimension reduction, projecting onto a predefined span of 
basis functions or a reduced set of eigenvectors (principal components). In contrast, we present 
a unified approach that directly incorporates geometric structure into the estimation process 
by exploiting the joint cigenproperties of the predictors and a linear penalty operator. In this 
sense, the components in the regression are 'partially empirical' and the framework is provided 
by the generalized singular value decomposition (GSVD). The form of the penalized estimation 
is not new, but the GSVD clarifies the process and informs the choice of penalty by making 
explicit the joint influence of the penalty and predictors on the bias, variance and performance 
of the estimated coefficient function. Laboratory spectroscopy data and simulations are used to 
illustrate the concepts. 

Keywords: functional data analysis, penalized regression, generalized singular value decomposi- 
tion, regularization. 



1 Introduction 

The coefficient function, /3, in a functional linear model (fLM) represents the linear relationship 
between responses, y, and predictors, x, either of which may appear as a function. We consider the 
special case of scalar-on-function regression, formally written as y = jj x{t)/3{t) dt + e, where x is a 
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random function, square integrable on a closed interval / C IR, and e a vector of random i.i.d. mean- 
zero errors. In many instances, one has an approximate idea about the informative structure of the 
predictors, such as the extent to which they are smooth, oscillatory, peaked, etc. Here we focus on 
analytical framework for incorporating such information into the estimation of /3. 

The analysis of data in this context involves a set of n responses {?/i}F=i corresponding to a 
set of predictor curves each arising as a discretized sampling of an idealized function; i.e., 

Xi = (xiiti), Xiitp)), for some, ti < ■ ■ ■ < tp, of I. In particular, the concept of geometric 
or spatial structure implies an order relation among the index parameter values. We assume the 
predictor functions have been sampled equally and densely enough to capture geometric structure 
of the type typically attributed to functions in (subspaces of) Lp'{I)- For this, it will be assumed 
that p > n although this condition is not necessary for our discussion. 

Several methods for estimating /3 are based on the eigenfunctions associated with the auto- 
covariance operator defined by the predictors \16\ 132] . These eigenfunctions provide an empiri- 
cal basis for representing the estimate and are the basis for the usual ordinary least-squares and 
principal-component estimates in multivariate analysis. The book by Ramsay and Silverman [38] 
summarize a variety of estimation methods that involve some combination of the empirical eigen- 
functions and smoothing, using B-splines or other technique, but none of these methods provide an 
analytically tractable way to incorporate presumed structure directly into the estimation process. 
The approach presented here achieves this by way of a penalty operator, £, defined on the space of 
predictor functions. 

The joint influence of the penalty and predictors on the estimated coefficient function is made 
explicit by way of the generalized singular value decomposition (GSVD) for a matrix pair. Just as 
the ordinary SVD provides the ingredients for an ordinary least squares estimate (in terms of the 
empirical basis), the GSVD provides a natural way to express a penalized least-squares estimate 
in terms of a basis derived from both the penalty and the predictors. We describe this in terms 
of the n X p matrix of sampled predictors, X, and an m x p discretized penalty operator, L. The 
general formulation is familiar as we consider estimates of /3 that arise from a squared-error loss 
with quadratic penalty: 



What distinguishes our presentation from others using this formulation is an emphasis on the joint 
spectral properties of the pair {X,L), as arise from the GSVD. We investigate the analytical role 
played by L in imposing structure on the estimate and focus on how the structure of L's least- 
dominant singular vectors should be commensurate with the informative structure of f3. 

In a Bayesian view, one may think of L as implementing a prior that favors a coefficient function 
lying near a particular subspace; this subspace is determined jointly by X and L. We note, however, 
that informative priors must come from somewhere and while they may come from expectations 
regarding smoothness, other information often exists — including pilot data, scientific knowledge 
or laboratory and instrumental properties. Our presentation aims to elucidate the role of L in 
providing a flexible means of implementing informative priors, regardless of their origin. 

The general concept of incorporating "structural information" into regularized estimation for 
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functional and image data is well established [21 \T2\ I36j . Methods for penalized regression have 
adopted this by constraining high-dimensional problems in various "structured" ways (sometimes 
with use of an norm): locally-constant structure [49^ I46j. spatial smoothness [20], correlation- 
based constraints [52], and network-dependence structure described via a graph [26]. These general 
penalties have been motivated by a variety of heuristics: Huang et al. [23] refer to the second- 
difference penalty as an "intuitive choice"; Hastie et al. [20] refer to a "structured penalty matrix 
[which] imposes smoothness with regard to an underlying space, time or frequency domain"; Tib- 
shirani and Taylor [50] note that the rows of L should "reflect some believed structure or geometry 
in the signal" ; and the penalties of Slawski et al. [l6] aim to capture "a priori association structure 
of the features in more generality than the fused lasso." 

The most common penalty is a (discretized) derivative operator, motivated by the heuristic of 
penalizing roughness (see [HI [38]). Our perspective on this is more analytical: since the eigenfunc- 
tions of the second-derivative operator C = (with zero boundary conditions on [0, 1]) are of the 
form ip{t) = sin(A;7rt), with eigenvalues /c^vr^ {k = 1,2,..), C implements the assumption that the 
coefficient function is well represented by low- frequency trigonometric functions. This is in contrast 
to ridge regression {L = I) which imposes no geometric structure. Although not typically viewed 
this way, the choice oi C = P^, or any differential operator, implies a favored basis for expansion of 
the estimate. 

A purely empirical basis comprised of a few dominant right singular vectors of X is a common and 
theoretically tractable choice. This is the essence of principal component regression (PGR) and these 
vectors also form the basis for a ridge estimate. Although this empirical basis does not technically 
impose local spatial structure (no order relation among the index parameter values is used) , it may 
be justified by arguing that a few principal component vectors capture the "greatest part" of a 
set of predictors |17j . Properties of this approach for signal regression is the focus of [7] and |16j . 
The functional data analysis framework of Ramsay and Silverman [38] provides two formulations of 
PGR. One in which the predictor curves are themselves smoothed prior to construction of principal 
components (chap. 8) and another that incorporates a roughness penalty into the construction 
of principal components (chap. 9), as originally proposed in [55]. In a related presentation on 
signal regression, Marx and Eilers [30] proposed a penalized B-spline approach in which predictors 
are transformed using a basis external to the problem (B-splines) and the estimated coefficient 
function is derived using the transform coefficients. Gombining ideas from [30] and [21], the "smooth 
principal components" method of [8] projects predictors onto the dominant eigenfunctions to obtain 
an estimate then uses B-splines in a procedure that smooths the estimate. Reiss and Ogden [50] 
provide a thorough study on several of these methods and propose modifications that include two 
versions of PGR using B-splines and second-derivative penalties: FPGRc applies the penalty to the 
construction of the principal components (cf. 05]), while FPCR/j incorporates the penalty into the 
regression (cf. [38]). 

In the context of nonparametric regression {X = I) the formulation ([T|) plays a dominant role for 
smoothing [54] • Related to this, Heckman and Ramsay [22] proposed a differential equations model- 
based estimate of a function whose properties are determined by a linear differential operator 
chosen from a parameterized family of differential equations, L/x = 0. In this context, however, the 
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GSVD is irrelevant since X does not appear and the role of L is relatively transparent. 

Algebraic details on the GSVD as it relates to penalized least-squares are given in section [3] with 
analytic expressions for various properties of the estimation process are described in section 13. 2[ 
Intuitively, smaller bias is obtained by an informed choice of L (the goal being small L^). The 
affect of such a choice on the variance is described analytically. Section H] describes several classes of 
structured penalties including two previously-proposed special cases that were justified by numerical 
simulations. The targeted penalties of subsection l4.2l are studied in more detail in section [5] including 
an analysis of the mean squared error for a family of penalized estimates which encompasses the 
ridge, principal-component and James-Stein estimates. 

The assumptions on L here are increasingly restrictive to the point where the estimates are only 
minor extensions of these well-studied estimates. The goal, however, is to analytically describe the 
substantial gains achievable by even mild extensions of these established methods. 

In applications the selection of the tuning parameter, a in ([T]), is important and so Section [6] 
describes our application of REML-based estimation for this. Numerical illustrations are provided 
in section [71 the simulation in subsection 17.11 is motivated by Reiss and Ogden's study of fLMs 
|40j : 17.21 presents a simulation using experimentally-derived Raman spectroscopy curves in which 
the "true" /? has naturally-occurring (laboratory) structure; and section [731 presents an application 
based on experimentally collected spectroscopy curves representing varied biochemical (nanoparti- 
cle) concentrations. An appendix looks at the simulation studied by Hall and Horowitz [16]. We 
begin in section [2] with a brief setup for notation and an introductory example. Note that for any 
L ^ I, the estimated (3 is not given in terms of the ordinary empirical singular vectors (of X), 
but rather in terms of a "partially empirical" basis arising from a simultaneous diagonalization of 
X' X and L' L via the GSVD. Hence, for brevity, we refer to (3a,L as a PEER (partially empirical 
eigenvector for regression) estimate whenever L ^ I. 

2 Background and simple example 

Let /3 represent a linear functional on Lp'{I) defining a linear relationship y = fj x{t)/3{t) dt + e 
(observed with error, e) between a response, y, and random predictor function, x S L'^{I)- We 
assume a set of n scalar responses corresponding to the set of n predictors, {xj}"^]^, each 

discretely sampled at common locations in /. Denote by X the n x p matrix whose ith row is a 
p-dimensional vector, Xj, of discretely sampled functions, and columns that are centered to have 
mean 0. The notation (•, •) will be used to denote the inner product on either L'^{I) or RP, depending 
on the context. 

The empirical covariance operator is i^' = y^X'X, but for functional predictors, typically p > nor 
else K is ill-conditioned or rank deficient. In this case, there are either infinitely many least-squares 
solutions, /3 = argmin^ ||?/ — X/3|p, or else any such solution is highly unstable and of little use. The 
least-squares solution having minimum norm is unique, however, and it can be obtained directly 
by the singular value decomposition (SVD): X = UDV' where the left and right singular vectors, 
Uk and Vk, are the columns of U and V, respectively, and D = [Di 0], where Di = diagjafc}^^]^, 
typically ordered as cti > ct2 > ... > 0"^ > (r = rank(X)). In terms of the SVD of X, the minimum- 
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norm solution is /3+ = X^y = Ylaki=o^^/ ^^^'^'k'y'^k) where denotes the Moore-Penrose inverse of 
X: X'^ = VD'^U' , where = diag{l/(Tfc if 7^ 0; if fj^ = 0}. The orthogonal vectors that form 
the columns of V are the eigenvectors of X' X and are sometimes referred to as a Karhunen-Loeve 
(K-L) basis for the row space of X. 

The solution /3+ is Marquardt's generalized inverse estimator whose properties are discussed in 
|29j . For functional data, /3+ is an unstable, meaningless solution. One obvious fix is to truncate the 
sum to d < r terms so that {cfc}f =x is bounded away from zero. This leads to the truncated singular 
value or principal component regression (PGR) estimate: /Spcr = Ptcn — ^dD^^Ud'y where here, 
and subsequently, we use the notation = col[ai, a^] to denote the first d columns of a matrix 
A. 

When L = I, the minimizer in ([T]) is the ridge penalty due to A. E. Hoerl |23j 

Pa,l = {X'X + aiy'X'y = V ( ) —u'.yvk, (2) 

or, ^a,i = VrF^DlUr'y, where = diagj ^^^^ }. The factor F" acts to counterweight, rather 
than truncate, the terms ^ as they get large. This is one of many possible filter factors which 
address problems of ill-determined rank (for more, see [121 ttH [33]). Weighted (or generalized) 
ridge regression replaces L = I with a diagonal matrix whose entries downweight those terms 
corresponding to the most variation [23]. Other "generalized ridge" estimates replace L = I by a 
discretized second-derivative operator, L = V^. Indeed, the Tikhonov-Phillips form of regularization 
([1]) has a long history in the context of differential equations [511 [36] and image analysis \15\ [33] 
with emphasis on numerical stability. In a linear model context, the smoothing imposed by this 
penalty was mentioned by Hastie and Mallows [21], discussed in Ramsay and Silverman [38] and 
used (on a the space of spline-transform coefficients) by Marx and Eilers [31], among others. The 
following simple example illustrates basic behavior for some of these penalties alongside an idealized 
PEER penalty. 

2.1 A simple example 

We consider a set of n = 50 bumpy predictor curves {xi} discretely sampled at p = 250 locations, 
as displayed in gray in the last panel of Figure [TJ The true coefficient function, /3, is displayed in 
black in this same panel. The responses are defined as yi = {xi,/3) + (e^ normal, uncorrelated 
mean-zero errors), and hence depend on the amplitudes of /3's three bumps centered at locations 
t = 45, 110,210. 

A detailed simulation with complete results are provided in section 17.11 Here we simply illus- 
trate the estimation process for L = /, as in 1^, in comparison with L = and an idealized 
PEER penalty. The latter is constructed using a visual inspection of the predictors and lightly pe- 
nalizes the subspace spanned by such structure, specifically, bumps centered at all visible locations 
(approximately t = 15,45,80,110,160,210,240). 

The first five panels serve to emphasize the role played by the structure of basis vectors that 
comprise the series expansion in ([2|) (in terms of ordinary singular vectors) versus the analogous 
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Figure 1: Partial sums of penalized estimates. The first five odd-numbered partial sums from ([7]) for 
three penalties: 2nd-derivative (dashed), ridge (gray), targeted PEER (black; see text in sections 12.11 and 
17. ip . The last panel displays (3 (black) and 15 predictors, Xi (gray), from the simulation. 



expansion (see ([7])) in terms of generalized singular vectors. In particular, Figure [T] shows several 
partial sums of ([7]) for these three penalties. The ridge process (gray) is, naturally, dominated by 
the right singular vectors of X which become increasingly noisy in successive partial sums. The 
second-derivative penalized estimate (dashed) is dominated by low-frequency structure, while the 
targeted PEER estimate converges quickly to the informative features. 

In this toy example, visual structure (spatial location) is used to define a regularization pro- 
cess that easily outperforms uninformed methods of penalization. Less visual examples where the 
penalty is defined by a set of laboratory-derived structure (in Raman spectroscopy curves) is given 
in sections 17.21 and 17.31 see Figure [2j In that setting, and in general, the role played by L is appro- 
priately viewed in terms of a preferred subspace in W determined by its singular vectors. Algebraic 
details about how structure in the estimation process is determined jointly by X and L ^ I are 
described next. 



3 Penalized least squares and the GSVD 

Of the many methods for estimating a coefficient function discussed in the Introduction, nearly 
are all aimed at imposing geometric or "functional" structure into the process via the use of basis 
functions in some manner. An alternative to choosing a basis outright is to exploit the structure 
imposed by an informed choice of penalty operator. The basis, determined by a pair {X,L), can 
be tailored toward structure of interest by the choice of L. When this is carried out in the least- 
squares setting of ([T|), the algebraic properties of the GSVD explicitly reveal how the structure of 
the estimate is inherited from the spectral properties of {X, L). 

3.1 The GSVD 

For a given linear penalty L and parameter a > 0, the estimate in ([1]) takes the form 

P^^L = {X'X + aL'L)-^X'y. (3) 
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This cannot be expressed using the singular vectors of X alone, but the generalized singular value 
decomposition of the pair [X, L) provides a tractable and interpretable series expansion. The 
GSVD appears in the literature in a variety of forms and notational conventions. Here we provide 
the necessary notation and properties of the GSVD for our purposes (see, e.g., [I9]) but refer to 
[H [131 ES] for a complete discussion and details about its computation. See also the comments of 
Bingham and Larntz [3]. 

Assume X is an n x p matrix [n < p) of rank n, L is an m xp matrix (m < p) of rank m. We also 
assume that n < m < p < m + n, and the rank of the (n + m) x p matrix Z := [X' L']' is p. A unique 
solution is guaranteed if the null spaces of X and L intersect trivially: Null(L) n Null(X) = {0}. 
This is not necessary for implementation, but it is natural in our applications and simplifies the 
notation. In addition, the condition p > n is not required, but rather than present notation for 
multiple cases, this will be assumed. 

Given X and L, the following matrices exist and form the decomposition below: an n x n matrix 
U and an m X m matrix V, each with orthonormal columns, U'U = I, V'V = I; diagonal matrices 
S {n X n) and M {m x m)\ and a nonsingular p x p matrix W such that 



X 
L 



usw~ 

VMW 



S=[Q S] 
M=[M 



S = diag{fTfc} 
M = diag{^fc}. 



(4) 



Si 



p diagonal entries ordered as 



Here, 5 and M are of the form S 
and Ml have q := n + m 

< CTi < 0-2 < • • • < CTg < 1 

1 > Ail > Ai2 > • • • > > 
These matrices satisfy 







and M 



I, 



p~n 







Ml 



, whose submatrices Si 



where 



5 ■•■) 'i; 



(5) 



w'x'xw 




Sf 
0/ 



gs, 



W'L'LW 



10 
Ml 




M'M, 



(6) 



with S^S + MlK = I- 

Denote the columns of C/, V and W by n^, and w^, respectively. In spite of the notation, 
the generalized singular vectors Uk and are not the same as the ordinary singular vectors of X 
in Section [2l They are the same when L = I, although their ordering is reversed; in that case, the 
ordinary singular values correspond to '■= (^k/f^k for fj.^ > 0. By the convention used for ordering 
the GS values and vectors, the last few columns of W span the subspace Null(-L) (or, if Null(-L) is 
empty, they correspond to the smallest GS values, fik)- We set d = dim(Null(L)) and note that 
Uk = for k > n — d. 

Now, equation ([ID and some algebra gives {X'X + aL'L)"^ = W{S!S, + aM'M)"^iy', and so 
f3a,L = W{S^S_ + a M ' M )~^ S'U'y. A consequence of the ordering adopted for the GS values and 
vectors is that the first p — n columns of W play no role in this solution; see equation ([3|). So we 
can replace W by the p x n matrix, Wn, consisting of the last n columns of W (corresponding to 
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the indices p — n + 1 to p). We index the columns of Wn as wi,...,Wn which is consistent with the 
indexing estabhshed in ([5]) for the singular values in S and M. Therefore, the L-penalized estimate 
can be expressed as a series in terms of GS values as 

n—d / ^2 \ " 

This GSV expansion corresponds to a new basis for the estimation process: the estimate is expressed 
in terms of GS vectors {wk} determined jointly by X and L; cf. the ridge estimate in 

For more compact notation used later, define the {n — d)x (n — d) diagonal matrix T = diag{7fc = 
(Tyfc//ifc}^~^. For brevity, set o := n — d and let Aq denote the first o columns of a matrix A. Also, 
denote by A0 the last d columns of A. In particular, the range of W0 is Null(-L). Using this notation, 
d?]) may be written concisely as 

Pa,L = WnoF'^T^Uo'y + W^U^'y, (8) 

r ^2 -> n-d 

where = diag < ■> , ■> > 

In summary, the utility of a penalty L depends on whether the true coefficient function shares 
structural properties with this GSVD basis, {wfc}fc=i- With regard to this, the importance of the 
parameter a may be reduced by a judicious choice of L since the terms in ([7]) corresponding to the 
vectors {wk : //fc = 0} are independent of the parameter a [53] , 

As we'll see, bias enters the estimate to the extent that the vectors {wk ■ 7^ 0} appear in the 
expansion ([7]). The portion of lia,L that extends beyond the subspace Null(L) is constrained by a 
sphere (of radius determined by a); this portion corresponds to bias. Hence, L may be chosen in 
such a way that the bias and variance of Pa^i arises from a specific type of structure, potentially 
decreasing bias without increasing complexity of the model. As a common example, L = T)^ 
introduces smooth bias structured by low- frequency trigonometric functions. 

3.2 Bias and variance and the choice of penalty operator 

Begin by observing that the penalized estimate Pa^L in dS]) is a linear transformation of any solution 
to the normal equations. Indeed, define = X"^ = {X' X+aL'L)~^X' and note that if /3 denotes 
any solution to X' Xf3 = X'y, then (3a, l = X'^X/^ + X'^e. The resolution operator X'^X reflects the 
extent to which the estimate in ([7]) is linearly transformed relative to an exact solution. In particular, 
E0a,L) = X*Xfi. Additionally, we have bias(/3a,L) = {I-X#X)/3 = a{X' X + aV L)-^L' L(3, and 
so ||bias(/3Q,^L)|| < ||a(X'X + aL'L)^^L'\\ \\Lf3\\. Hence bias can be controlled by the choice of L, 
with an estimate being unbiased whenever Lf3 = 0. There is a tradeoff, of course, and equation (jlOp 
below quantifies the effect on the variance as determined by W0 (i.e., {wk}^^n-d+\) Null(L) is 
chosen to be too large. 

More generally, the decompositions in ^ lead to an expression for the resolution matrix as 
X*X = W{S^S + aMlM)-^S^SW-^, and so / - X*X = aW{S!S + aMlM)-^MlMW"^. Again, 
from equation @, the first p — n rows of are not used. For notational convenience, define 

W := W'~^ (note, W plays a role analogous toV = V'^^ in the SVD). As before, let Wn denote the 
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p X n matrix consisting of the last n columns of W and note that in equation (H]), X = US_W ^ = 
USiWn)'. Hence, I-X*X = aWniS"^ + aM"^)"^ M"^ {Wn)' , and so the bias of can be expressed 
as 

n—d 2 

bias(/3,.L) = (/ - X*X)(3 = V ^ WkWk'f^ (9) 

where wj. is the fcth column of W . In particular, the bias vector Q is contained in span{w;fc : //j, 7^ 0}, 
whereas the estimate j5a,L is in spanjiUfc : 0"^ 7^ 0}. In the special case that X' X is invertible, then 

/3 = WS-^U'y and bias(^«,L) = Efc (^^) ^^^^^^ (see [SH]). 

A counterpart is an expression for the variance in terms of the GSVD. Let S denote the covari- 
ance for e. Then var(/3a^/,) = var(X*X/? + X*e) = Assuming S = Ue/, this simplifies 

to 

(n—d 2 \ 

fc=l ^^fe + "^fc^ fc=n-d+l / 

An interesting perspective of the bias-variance tradeoff is provided by the relationship between 
the GS-values in ([5]) and their role in equations ([9]) and (jlOp . Moreover, these lead to an explicit 
expression for the mean squared error (MSE) of a PEER estimate. Since E{^a,L) = X'^X/S, 

MSE(^«,i) = Ei\\(3 - ^^^Lf) = Em\^ + ||/3«,i.||2 - 2(/3,/3„,i)) 
= \\P-X*XP\\^ + a^^tTace{X*X*') 

n—d 2 ^^^^ 

= ||bias(/3^,L)|P+^e E ^ 2^'' 2N2 II^^-Il^- 

The GS-vectors {wk} are not necessarily orthogonal, although they are X'X-orthogonal; see ([6]). 
Consequently, a bound, rather than equality, for the MSE in terms of the GS values/vectors is the 
best one can do in general: 

/n-d 2 \ ^ 

MSE(/3c,,l) < ^ ^"^^ ^ Wk'PWwkW + (the second term in ([H])). (12) 



As a final remark, recall that one perspective on ridge estimation defines fictitious data from 
an orthogonal "experiment," represented by an L, and expresses / as / = L'L [29j . Regardless of 
orthogonality this applies to any penalized estimate and L may similarly be viewed as augmenting 
the data, influencing the estimation process through its eigenstructure; the response, y, is set to 
zero for these supplementary "data". In this view, equation ([3]) can be written as Zf3 = y where 



Z 



X 



and y 



V 




. This formulation proves useful in section 15.31 when assuring that the 



estimation process is stable with respect to perturbations in X and the choice of L. 

4 Structured penalties 

A structured penalty refers to a second term in ([1]) that involves an operator chosen to encourage 
certain functional properties in the estimate. A prototypical example is a derivative operator which 
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imposes smoothness via its eigenstructure. Here we describe several examples of structured penal- 
ties, including two that were motivated heuristically and implemented without regard to the spectral 
properties that define their performance. Sections 13.21 and 15.31 provide a complete formulation of 
their properties as revealed by the GSVD. 

4.1 The penalty of C. Goutis 

The concept of using a penalty operator whose eigenstructure is targeted toward specific properties 
in the predictors appears implicitly in the work of C. Goutis [H]. This method aimed to account 
for the "functional nature of the predictors" without oversmoothing and, in essence, considered the 
inverse of a smoothing penalty. Specifically, if A denotes a discretized second-derivative operator 
(with some specified boundary conditions), the minimization in ([1]) was replaced by min^{||y — 
XA'A/3||^„ +a||A/3||^2}- Here, the term XA'A/5 can be viewed as the product of X/\' (derivatives 
of the predictor curves) and A/3 (derivative of /3). Defining 7 := A' A/3 and seeking a penalized 
estimate of 7 leads to 

7 = (X'X + a(A'A)t)-ix'?/ 

= argmin{||y - X^\\^ + a{j, {A'A)^)}. 
7 

In |14j . the properties of 7 were conjectured to result from the eigenproperties of (A'A)^. This was 
explored by ignoring X and plotting some eigenvectors of (A'A)^. The properties of this method 
become transparent, however, when formulated in terms of the GSVD. That is, let L := ((A'A)^)^/^ 
and note the functions that define 7 are infiuenced most by the highly oscillatory eigenvectors of L 
which correspond to its smallest eigenvalues; see equations ([5]) and ([7]). 

This approach was applied in [13] only for prediction and has drawbacks in producing an inter- 
pretable estimate, especially for non-smooth predictor curves. The general insight is valid, however, 
and modifications of this penalty can be used to produce more stable results. The operator (A'A)^ 
essentially reverses the frequency properties of the eigenvectors of A and is an extreme alternative 
to this smoothing penalty. An eigenanalysis of the pair {X,L), however, suggests penalties that 
may be more suited to the problem. This is illustrated in Section [71 

4.2 Targeted penalties 

Given some knowledge about the relevant structure, a penalty can be defined in terms of a subspace 
containing this structure. For example, suppose /3 G Q := spanjqj}^^^ in L^(/). Set Q = Yl'j=i 
qj and consider the orthogonal projection Pq = QQ^ . (Here, q q denotes the rank-one operator 
/ ^ {f,Q)Q, for / G L^(/).) For L = / - Pq, then /3 G Nun(L) and ^a,L is unbiased. The problem 
may still be underdetermined so, more pragmatically, define a decomposition-based penalty 

L^LQ = a{I- Pq) + bPQ (14) 

for some a,b > 0. Heuristically, when a > 6 > the effect is to move the estimate towards Q 
by preferentially penalizing components orthogonal to Q; i.e., assign a prior favoring structure 
contained in the subspace Q. To implement the tradeoff between the two subspaces, we view 



10 



a and b as inversely related, ab = const. The analytical properties of estimates that arise from 
this are developed in the next section and illustrated numerically in Section [71 For example, bias 
is substantially reduced when /3 C Q, and equation (fT9l) quantifies the tradeoff with respect to 
variance when the prior Q is chosen poorly. 

More generally, one may penalize each subspace differently by defining L = ai{I — Pq)Ci{I — 
Pq) + q;2-Pq^2-Pq5 for some operators £i and £2- This idea could be carried further: for any 
orthogonal decomposition of by subspaces Qi, ■ ■ ■ , Qj, let Pj be the projection onto Qj. 

Then the multi-space penalty L = J2j=i ^j-^j leads to the estimate 

J 

/3 = argmin{||y - X/3||2 + V aj||P,-/3|P}. 

This concept was applied in the context of image recovery (where X represents a linear distortion 
model for a degraded image y) by Beige et al. [1]. 

The examples here illustrate ways in which assumptions about the structure of a coefficient 
function can be incorporated directly into the estimation process. In general, any estimation of (3 
imposes assumptions about its structure (either implicitly or explicitly) and section [3^2] shows that 
the bias-variance tradeoff involves a choice on the type of bias (spatial structure) as well as the 
extent of bias (regularization parameter(s)). 

5 Some analytical properties 

Any direct comparison between estimates using different penalty operators is confounded by the 
fact there is no simple connection between the generalized singular values/vectors and the ordi- 
nary singular values/ vectors. Therefore, we first consider the case of targeted or projection-based 
penalties (jl4p . Within this class, we introduce a parameterized family of estimates that are com- 
prised of ordinary singular values/vectors. Since the ridge and PGR estimates are contained in (or 
a limit of) this family, a comparison with some targeted PEER estimates is possible. For more 
general penalized estimates, properties of perturbations provide some less precise relationships; see 
proposition 15.61 

5.1 Transformation to standard form 

We have reason to consider decomposition-based penalties (114p in which L is invertible. In this 
case, an expression for the estimate does not involve the second term in ([8]), and decomposing the 
first term into two parts will be useful. For this, we find it convenient to use the standard-form 
transformation due to Elden [11] in which the penalty L is absorbed into X. This transformation 
also provides a computational alternative to the GSVD which, for projection-based penalties in 
particular, can be less computationally expensive; see, e.g., |25j . By this transformation of X, a 
general PEER estimate (L 7^ /) can be expressed via a ridge-regression process. 

Define the X -weighted generalized inverse of L and the corresponding transformed X as: 

:= (/ - [X{I - L^L)]^X)L^ and X := X L^; 
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see imilQj. In terms of the GSVD components (jH), the transformed X is X = UW . In particular, 
the diagonal elements of F = SM^ are the ordinary singular values of X, but in reversed order. 
Moreover, a PEER estimate can be obtained from a ridge-like penalization process with respect to 
X. That is, for 



arg mmjlly 



+ a||J3||2} where j=[X{I-L^L)] 



y, 



(15) 



then 



Note that the transformed estimate as given in terms of the GSVD factors is: Pa = VFT^U'y, 
where F = diag{7^/(7^ + a)}. 

In what follows we consider invertible L in which case L^^ = L'^ and [X{I - LtL)]t = 0. In 
particular, /3a^L = L~^^a- For the penalty of the form L = a{I — Pq) + bPQ, then = 
^(/ — Pq) + ^Pq, and so X = \X{I — Pq) + ^XPq. The regularization parameter, previously 
denoted by a, can be absorbed into the values a and b, so we will denote this PEER estimate /3a,L 
simply as ^a,b- 



Remark 5.1. When a = b = ^/a, this is simply a ridge estimate: pa,b = /5a,/- Therefore, the 
best performance among this family of estimates is as least as good as the performance of ridge, 
regardless of the choice of Q. 



5.2 SVD targeted penalties 

Consider the special case in which Q is the span of the d largest right singular vectors of an n x p 
matrix X of rank n. Let X = [/ [ D ] y be an ordinary singular value decomposition where 
D is a diagonal matrix of singular values. For consistency with the GSVD notation, these will be 
ordered as < cJi < • • • < (T„. As before, the first p — n columns of V are not used. Rather than 
introduce extra notation, we write X = UDV', letting V denote the n xp whose columns correspond 
to the singular vectors in D. So now, the last d columns of V correspond to the d largest singular 
values of X (i.e., Q = V0). 

We are interested interested in the penalty L = a{I — Pq) + 6Pq, where d = dim(Null(/ — Pq)). 
Similar to before, set o = n — d and define 0x0 and d x d submatrices, Do and D0, of D as 



D 



Do 
D. 



also set A 



alo 
bid 



(16) 



Here, Pq = V^Vj and (/ - Pq) = VoVo and so, 

X = -UDV'iVoVo') + \UDV\V0VJ) = -UD 
aba 



Vo' 





+ tUD 






VJ 



u 



iDoVo' 




+ u 





ID0V0' 



U{Dhr^)V' 



This decomposition implies that the ridge estimate in (|15p is of the following form: setting G 
DA~^, denoting its diagonal entries by {7/;}, and defining F = diag{7^/(7^ + 1)} gives [p 
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VFG^U'y. Now, 

L- V = -VoVo'V + \v,V,'V =\^°] \ \° 1° 1 = FA-^ 
a h \_ \ \_ ^ b^d 

and so /3„,6 = L-^p = L~^{y FG'^U'y) = VA^^FAD~Wy. By the decomposition 

^a,b = VoFoD-^Uo'y + V^F^D^^U^y. 
This shows that the estimate decomposes as fohows. 

Theorem 5.2. Let Q be the span of the largest d right singular vectors of X. Set L = a{I — Pq) + 
5Pq. Then, in terms of the notation above, the estimate /3a^b decomposes as 

n—d y ^2 \ " / CT'^ \ 1 

k=l ^ k=d+l 

where the left and right terms are independent of b and a, respectively. 

Similar arguments can be used to decompose an estimate for arbitrary Q: 



n—d / ^2 \ 2 " / (T^ \ 1 

Kh = E ( 2 i^^^ 2 ) - ^kVWk + V ( 2 ^^^2 2 ) - ^kV^k- (18) 



In this case, however, all terms are dependent on both o and b. Indeed, using notation as in ([8]) 
one can decompose X = \UYoV' + \UY^V' and obtain p = VFV'^U'y. However, L^V = WM\ 
and the non-orthogonal terms provided by W do not decompose the estimate into terms from the 
orthogonal sum Q © Q-*- . 

The following corollary, along with Remark 15. H records the manner in which (jl7p is a family 
of penalized estimates, parameterized by a, 6 > and d G {l,...,n}, that extends some standard 
estimates. 

Corollary 5.3. Under the conditions in Theorem \5. SI 

1. when a > b > 0, Pa,b is a sum of weighted ridge estimates on Q and Q"*"; 

2. when a > and 6 = 0, /3a^o is given by ([8]), which is a sum of PCR and ridge estimates on Q 
and Q-*", respectively; 

3. for each d, the PCR estimate /Sp^^ is the limit of Pafi as a —t- oo. 

In item 2, this estimate is similar to PCR except that a ridge penalty is placed on the least- 
dominant singular vectors. Under the assumptions here, Wk = Vk are the ordinary singular vectors 
of X and the ordinary singular values appear as 7^ = Ck/f^ki for fik > 0. In the second term of 
([8]), the singular vectors are in the null space of L (since 6 = 0), and so /i^ = and ak = 1, for 
k = n — d + 1, ■■■,p. Regarding item 3, although a PCR estimate is not obtained from equation ([3|) 
for any L, it is a limit of such estimates. 

Other decompositions may be obtained simply by using a permutation, such as Q = HV, for 
some n X n permutation matrix 11. Stein's estimate, f3a,s, also fits into this framework as follows. 
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When X'X is nonsingular, then = {X' X+aX' Xy^X'y (see, e.g., the class 'STEIN' in [TO]), 
and X' X = VD'DV' . Hence this estimate arises from the penalty Lg = DV' . This is a re- weighted 
version of L = a{I — Pq) where d = n, Q = V and the parameter a is replaced by the matrix D. The 
result is a constant filter factor F = diag{l/(l + Q;)}. Using d <n and Q = is a natural extension 
of this idea. More generally, Q may be enriched with functions that span a wider range of structure 
potentially relevant to the estimate. This concept is illustrated in Section [7l3] where instead of V^, 
we use a d-dimensional set of experimentally-derived "template" spectra supplemented with their 
derivatives to define Q. 

As an aside, we note that in a different approach to regularization one can define a general 
family of estimates arising from the SVD by way of ji^^^p = VTif^U'y, where S/^ = diag{^(^(^)}, and 
99 : [R+ — > [R is an arbitrary continuous function [33]. A ridge estimate is obtained for ip(t) = 
and PGR obtained for {p{t) = 1/t if t > 1, ip{t) = if t < 1 (an L^-limit of continuous functions). 
This is similar to item 3 in Corollary 15.31 but the family of estimates l3h,ip is formulated in terms of 
functional filter factors rather than explicit penalty operators. Related to this is the fact that the 
optimal (with respect to MSE) estimate using SVD filter factors is, in the case C = a^I, expressed 



as /3oH = VFDWy, where F = diag{al/{al + cr^W)-^)}; see the "ideal filter" of 03. In fact. 



it's easy to check that this optimal estimate can be obtained as /3oh = /3a, L for some L ^ F Since 
/3oH involves knowledge of /3, it is not directly obtainable but it points to the optimality of a PEER 
estimate. 

5.3 The MSE of some penalized estimates. 

Theorem 15.21 is used here to show that /3a, fe can have smaller MSE than the ridge or PGR estimates 
for a wide range of values of a and/or b. The MSE is potentially decreased further when L is defined 
by a more general Q. In that case, a general statement is difficult to formulate but Proposition 15.61 
confirms that any improvement in MSE is robust to perturbations in L (e.g., general Q) and errors 
in X. 

An immediate consequence of Theorem 15.21 is that the mean squared error for an estimate in 
this family ()17p decomposes into easily-identifiable terms for the bias and variance: 



to d, the cTfc's in the third term decrease and the contribution to the variance from this term 
increases — the estimate fails for the same reason that ordinary least-squares fails. Any nonzero b 
stabilizes the estimate in the same way that a nonzero a stabilizes a standard ridge estimate; the 
decomposition ([T7|) merely re-focuses the penalty. This is illustrated in Section [7] (Table [1]) and 
in the Appendix (Tabled]). Although there are three parameters to consider, the MSE of f3afi is 
relatively insensitive to 5 > for sufficiently large d. This could be optimized (similar to efforts to 




(19) 



The influence of 6 = on the estimate is now clear: when the numerical rank of X is small relative 
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optimize the number of principal components) but here we assume some knowledge regarding Q, 
hence d. Relationships between ridge, PGR and PEER estimates in this family {Pa,b}a,b>o can be 
quantified more specifically as follows. 

Proposition 5.4. Suppose (3 £ Q and fix a > 0. Then for any a > ^/a, the ridge estimate satisfies 



Proof. This follows from the fact that Vo'P = and so the second term in (jl9p is zero. Therefore, 
the contribution to the MSE by the first term is decreased whenever a > y/a. □ 

If /3 is exactly a sum of the d dominant right singular vectors, A PGR estimate using d terms 
may perform well, but is not optimal: 

Proposition 5.5. If f3 £ Q, a sufficient condition for the PCR estimate to satisfy 



MSE0i^^) = MSE0^fl) > MSE0^^, 



b) 



IS 

1 2d 



E + i >\\y^'^\\' (20) 



Kk=n—d+l " 



Note that the left side of (|20]l increases without bound as Uk — )• 0. Since || V0'/3|p = Sfc=„_rf+i(ffc/3)^, 
and since the premise of PGR is that decreases with decreasing a^, this sufficient condition is 
entirely plausible. 



Proof. The MSE of Ppa^ consists of the second and third terms of (jl9p : 



^PCR 

n—d 

fc=l k=n-d+l « 



n—d n ^ 



In particular, a sufficient condition for this to exceed MSE(/3oo,b) is for the variance term to exceed 
the third and fourth terms of ()19p : 



E 4>-? E 7^.^ E 



2 

k=n-d+l k k=n-d+l ^ k ' > k=n-d+l ^ k 



Its easy to check that this is satisfied when (|2Up holds. □ 

A comment by Bingham and Larntz [3] on the intensive simulation study of ridge regression in 
|10] notes that "it is not at all clear that ridge methods offer a clear-cut improvement over [ordinary] 
least squares except for particular orientations of /3 relative to the eigenvectors of X'X." Equation 
(|19p repeats this observation relating these two classical methods as well as the minor extensions 
contained in (jl7p . If, on the other hand, "the orientation of /3 relative to the [ffc's]" is not favorable, 
i.e., if /3 is nowhere near the range of V, then a PEER estimate as in (jlSp is more desirable than 
(fT7|) (assuming sufficient prior knowledge). 
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In summary, the family of estimates {/3a,b}a,b>o ™ ()17p represents a hybrid of ridge and PGR 
estimation. This family — based on the ordinary singular vectors of X — is introduced here to provide 
a framework within which these two familiar estimates can be compared to (slightly) more general 
PEER estimates. Direct analytical comparison between general PEER estimates is more difficult 
since there's no simple relationship between the generalized singular vectors for two different L 
(including L = I versus L ^ I). However, it is important that the estimation process be stable 
with respect to changes in L and/or X. I.e., in going from an estimate in (jl7p to one in (jlSp . the 
performance of the estimate should be predictably altered. Given an estimate in Proposition 15. 4| if 
Q is modified and/or X is observed with error, the MSE of the corresponding estimate, i, should 
be controlled: for sufficiently small perturbation E, the corresponding estimate MSE(/3^j^) should 
be close to MSE(/3q,^/). This "stability" is true in general. To see this, recall Z = ^ X' ^/aL' ] 
(of rank p) and y = \y' ]'. Then another way to represent the estimate ^ is /3q,^l = Z'^y. Let 
E = [ E\ E2 ]' for some nx p and m x p matrices Ei and E2. Set Ze = Z + E and denote the 
perturbed estimate by = z\,y. By continuity of the generalized inverse (e.g., [1], Section 1.4), 
lim||£;||_j.o = Z"^ if and only if lim||£;||_i.o rank(Z£;) = rank(Z). Therefore, provided the rank of Z 
is not changed by E, 

lim ||/3„,L-/3f;i|| < lim ||Zt - 4|| ||y|| = 0, 



and hence MSE(/?^^) ^ MSE(/5^,l) as ||^|| ^ 0. A more specific bound on the difference of 



estimates can be obtained under the condition < 1 which implies that ||.^^|| < 

This can be used to obtain the following bound. 

Proposition 5.6. Assume ||Zt||||S|| < 1 andletr = y- ZPa,L- Then 

U~R U<J1^!}ML(UR II , ||7t| 

\\Pa,L - Pa,L\\ < 1 _ 1 1 1 1 1 1 | v'^"'^" " ' 

See [3] and 



6 Tuning parameter selection 

Despite our focus on the GSVD, the computation of a PEER estimate in ([1]) does not, of course, 
require that this decomposition be computed. Rather, the role of the GSVD has been to provide 
analytical insight into the role a penalty operator plays in the estimation process. For computation, 
on the other hand, we have chosen to use a method in which the tuning parameter, a, is estimated 
as part of the coefficient-function estimation process. 

Because the choice of tuning parameter is so important, many selection criteria have been 
proposed, including generalized cross-validation (GGV) [9], AIC and its finite sample corrections 
|55j . As an alternative to GCV and AIC, a recently-proven equivalence between the penalized 
least squares estimation and a linear mixed model (LMM) representation [6] can be used. In 
particular, the best linear unbiased predictor (BLUP) of the response y is composed of the best 
linear unbiased estimator of the fixed effects and BLUP of the random effects for the given values 
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of the random component variances (see [U] and [6]). Within the LMM framework, restricted 
maximum hkehhood (REML) can be used to estimate the variance components and thus the choice 
of the tuning parameter, a, which is equal to the ratio of the error variance and the random effects 
variance [l2] . REML-based estimation of the tuning parameter has been shown to perform at least 
as well as the other criteria and, under certain conditions, it seen to be less variable than GCV-based 
estimation [41]. In our case, the penalized least-squares criterion ([T]) is equivalent to 

Pa,L = argmin{||y - XunpPunp - ^pen/3pen|P + a||L/3||^} (21) 

where /3 = [/3^„p P'pen]' ^ the Xunp corresponds to the unpenalized part of the design matrix, and 
Xpen to the penalized part. 

For simplicity of presentation, we describe the transformation with an invertible L. However, a 
generalized inverse can be used in case L is not of full rank; see equation (jlSp . Also, to facilitate 
a straightforward use of existing linear mixed model routines in widely available software packages 
(e.g., R |37j or SAS software 03]), we transform the coefficient vector f3 using the inverse of the 
matrix L. Let X* = XL and /3* = L~^j3. Then equation (|2ip can be modified as follows 

= argmin{||y - X'^^f + 

This REML-based estimation of tuning parameters is used in the application of Section 17.31 

For estimation of the parameters a, h and a involved in the decomposition-based penalty of 
equation (jl7p . we view a and h as weights in a tradeoff between the subspaces and assume ab = 
const. For implementation, we fix one, estimate the other using a grid search, and use REML to 
estimate a. 

7 Numerical examples 

To illustrate algebraic properties given in Section [5l we consider PEER estimation alongside some 
familiar methods in several numerical examples. Section 17.11 elaborates on the simple example in 
Section [2.11 These mass spectrometry-like predictors are mathematically synthesized in a manner 
similar to the study of Reiss and Ogden [JO] (see also a numerical study in [S]). Here, /3 is also 
synthesized to represent a spectrum, or specific set of bumps. In contrast. Section 17.31 presents 
a real application to Raman spectroscopy data in which a set of spectra {xi\ and nanoparticle 
concentrations {yi} are obtained from sets of laboratory mixtures. This laboratory-based application 
is preceded in section 17.21 by a simulation that uses these same Raman spectra. In both Raman 
examples, targeted penalties are defined using discretized functions qj chosen to span specific 
subspaces, Q = span{(7j}^^^. As before, let Q = col[gi, qd] and Pq = QQ'^ . 

Each section displays the results from several methods, including derivative-based penalties. 
Implementing these requires a choice of discretization scheme and boundary conditions which define 
the operator. We use where T) = [djj] is a square matrix with entries di^i = 1, dj^j+i = — 1 and 
(ijj = otherwise. In addition to some standard estimates, sections 17.31 and 17.21 also consider 
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FPCR/j, a functional PGR estimate described in [JO]. This approach extends the penahzed B- 
sphne estimates of [8] and assumes (3 = Bij where B is an px K matrix whose columns consist of K 
B-spline functions and rj is a vector of B-spline coefficients. The estimation process takes place in 
the coefficient space using the penalty L = T)^ applied to rj. The FPCR/j estimate further assumes 
/3 = BVdT] {Vd as defined in section [2]). 

Estimation error is defined as mean squared error (MSE) ||/3 — /3a,L|P, and the prediction error 
defined similarly as Y^- \yi — yi\'^, where yi = {xi, /3). Each simulation incorporates response random 
errors, ~ N{0,a'^), added to the ith true response, yf"° = {xi,/3). Letting Sy denote the 
sample variance in the set {yf^^jf^i, the response random errors, e^, are chosen such that := 
Sy/iSy + o"^] (the squared multiple correlation coefficient of the true model) takes values 0.6 and 
0.8. In sections 17.11 and 17.21 tuning parameters are chosen by a grid search. In section [731 tuning 
parameters are chosen using REML, as described in section [6l 

7.1 Bumps Simulation 

Here we elaborate on the simple example of section 12.11 This simulation involves bumpy predictor 
curves Xi{t) with a response yi that depends on the amplitudes Xi{t) at some of the bump locations, 
t = Cfc, via the regression function f3. In particular, 

= X] 6xp[-^i(* - Cj)] + ei{t), (3{t) = ^ aj exp[-bj(t - cj)], t G [0, 1] 

where Jx = {2, 6, 10, 14, 20, 26, 30}, Jjs = {6, 14, 26}, are the magnitudes, 5^ are the spreads, and 
are the locations of the bumps. In the first simulation, we set bj = 10000 and Cj = 0.004(8j — 1), 
the same for each curve Xj. This mimics, for instance, curves seen in mass spectrometry data. 
The assumption Jg C Jx simulates a setting in which the response is associated with a subset of 
metabolite or protein features in a collection of spectra. The ajj's are from a uniform distribution, 
and Oj = 3,5,2 for j = 6,14,24, respectively. We consider discretized curves, Xi{t), evaluated at 
p = 250 points, tj, j = 1, . . . ,p. The sample size is fixed at n = 50 in each case. 
Penalties. We consider a variety of estimation procedures: ridge (L = I), second-derivative 
(P^), a more general derivative operator (D^ + 0/) and PGR. We also define two decomposition- 
based penalties (fT^ formed by specific subspaces Q = span{qj}j^j for qj of the form qj{t) = 
aj exp[bj{t — Cj)], with cj at all locations seen in the predictors, Jy = {2, 6, 10, 14, 20, 26, 30}, or at 
uniformly-spaced locations, Ju = {2,4, . . . , 30}; denote these penalties by Ly and Lu, respectively. 
Simulation results. The simulation incorporates two sources of noise: (i) response random errors, 
~ A^(0, cr^), added to the ith true response so that = 0.6,0.8; (ii) measurement error, Cj ~ 
Np{0,a^I), added to the ith predictor, Xj. To define a signal-to-noise ratio, S/N, set Sf := \\xi — 
fiiW^ / {p — 1) , where fii is the mean value of Xi, and set 5^ := ^/n'^-Sf. The are chosen so that 
S/N := Sx/(Te = 2,5,10. 

Figure [1] shows a few partial sums of ^ for estimates arising from three penalties: P^, L = I 
and Ly, when R^ = 0.8 and S/N = 2. Tabled] gives a summary of estimation errors. The penalty 
Ly, exploiting known structure, performs well in terms of estimation error. Not surprisingly, a 
penalty that encourages low-frequency singular vectors, P^, is a poor choice although + al 
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easily improves on since the GSVs are more compatible with the relevant structure. PGR 
performs well with estimation errors that can be several times smaller than those of ridge. The 
number of terms used in PGR ranges here from 8 {S/N = 10) to 25 {S/N = 2). 

Table 1: Estimation errors (MSE) for simulation with selected bump locations. Sample size is n = 50. 





S/N 


Lv 


Lu 


PGR 


ridge 




V'^ + al 


0.8 


10 


4.00 


13.81 


9.38 


34.39 


359.83 


76.31 


0.8 


5 


3.72 


15.46 


21.50 


40.02 


246.17 


72.64 


0.8 


2 


4.40 


12.96 


57.89 


58.22 


126.75 


59.35 


0.6 


10 


9.60 


21.60 


14.10 


50.50 


497.70 


113.60 


0.6 


5 


10.22 


21.65 


26.02 


50.68 


338.70 


87.58 


0.6 


2 


11.75 


23.18 


63.50 


67.94 


181.75 


78.45 



Predictably, PGR performance degrades with decreasing S/N, a property that is less pro- 
nounced, or not shared, by other estimates. Performances of Ly and Ljj illustrate properties 
described in Section [5.31 As S/N — > 0, the ordinary singular vectors of X (on which ridge and PGR 
rely) decreasingly represent the structure in /3. The GS vectors of {X,Lv) and {X,Lij), however, 
retain structure relevant for representing /3. 

Table 2: Prediction errors for simulation with selected bump locations. Sample size is n = 50. Errors are 
multiplied by 1000 for display. 





S/N 


Lv 


Lu 


PGR 


ridge 


V' 


V'^ + al 


0.8 


10 


9.0 


10.5 


10.8 


16.6 


19.3 


12.9 


0.8 


5 


8.4 


11.0 


12.2 


26.7 


27.9 


17.8 


0.8 


2 


12.9 


19.0 


53.2 


55.7 


50.3 


40.1 


0.6 


10 


21.4 


23.0 


23.9 


33.0 


39.0 


26.2 


0.6 


5 


23.9 


25.0 


29.5 


49.2 


54.6 


34.4 


0.6 


2 


34.4 


42.5 


90.4 


110.4 


104.4 


77.9 



Table [2] summarizes prediction errors. When S/N is large, performance of PGR is comparable 
with Ly and Lu, but degrades for low S/N. Here, even + al provides smaller prediction errors, 
in most cases, than ridge, or PGR. This illustrates the GS vectors role in (jl3p and reiterates 
observations in [H]. 

7.2 Raman simulation 

We consider Raman spectroscopy curves which represent a vibrational response of laser-excited co- 
organic/inorganic nanoparticles (GOINs). Each GOIN has a unique signature spectrum and serves 
as a sensitive nanotag for immunoassays; see [271 144j . Each spectrum consists of absorbance values 
measured at p = 600 wavenumbers. By the Beer-Lambert law, light absorbance increases linearly 
with a GOIN's concentration and so a spectrum from a mixture of GOINs is reasonably modeled 
by a linear combination of pure GOIN spectra. The data here come from experiments that were 
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400 600 800 1000 1200 1400 1600 1800 

Figure 2: Nine pure COIN spectra, Pi,...,Pg, and a coefScient function, /3 (each shifted for display). /3 
arises as a solution to the fLM in which y denotes concentrations of Pg in an in silico mixture of 50 COIN 
spectra, Xi (light gray). This /? is used in the simulation study of Section 17.21 

designed to establish the ability of these COINs to measure the existence and abundance of antigens 
in single-cell assays. 

Let Pi, Pio denote spectra from nine pure COINs and one "blank" (no biochemical material), 
each normalized to norm one. We form in-silico mixtures as follows: Xi = X^^^i Q.^-P/o i = 
with coefficients {ci^k} generated from a uniform distribution. Figure [2] shows representative spectra 
from all nine COINs superimposed on a collection of mixture spectra, {xi}f^^. Included in Figure [2] 
is the /3 (dashed curve) used to defined the simulation: y,/ = {xi,(3) + e, e ~ N{0,a'^). 

In this simulation, we have created a coefficient function which, instead of being modeled math- 
ematically, is a curve that exhibits structure of the type found in Raman spectra. Details on the 
construction of this /3 are in Appendix 19. II so here we simply note that it arises as a ridge estimate 
from a set of in-silico mixtures of Raman spectra in which one COIN, Pg, is varied prominently 
relative to the others. See Figure [2j Motivation for defining (3 in this way is based on a view that 
it seems implausible for us to predict the structure of realistic signal in these data and recreate it 
using polynomials, Gaussians or other analytic functions. 

Regardless of its construction, (3 defines signal that allows us to compute estimation and pre- 
diction error. The performances of five methods are summarized in Table [3l Note that although /3 
was constructed as a ridge estimate (using a different set of in-silico mixtures; see Appendix 19. ip . 
the ridge penalty is not necessarily optimal for recovering /3. This is because the strictly empirical 
eigenvectors associated with the new spectra may contain structure not informative regarding y. 
Also, in these data, the performance of FPCR/j is adversely affected by a tendency for the estimate 
to be smooth; cf.. Figure [3l The PEER penalty used here is defined by a decomposition-based 
operator ()17p in which Q is spanned by a 10-dimensional set of pure-COIN spectra (including a 
blank). The success of such an estimate obviously depends on an informed formation of Q, but 
as long as the parameter-selection procedure allows for a = b, then the set of possible estimates 
includes ridge as well as estimates with potentially lower MSE than ridge; see Proposition 15.41 

We note that this simulation may be viewed as inherently unfair since the PEER estimate uses 
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Table 3: Estimation (MSE) and prediction (PE) errors of several penalization methods for the simulation 
described in Figure [2j Numbers represent the average error from 100 runs. PE errors are multiplied by 1000 
for display. 





Lq 


PGR 


ridge 


p2 + a/ 


FPCRr 


MSE 
PE 


8.91 
0.0071 


12.34 
0.0179 


13.87 
0.0139 


41.69 
0.0131 


15.29 
0.0175 



knowledge about the relevant structure. However, this is a point worth reemphasizing: when prior 
knowledge about the structure of the data is available, it can be incorporated naturally into the 
regression problem. 

7.3 Raman Application 

We now consider spectra representing true antibody-conjugated COINs from nine laboratory mix- 
tures. These mixtures contain various concentrations of eight COINs (of the nine shown in Figure[2]). 
Spectra from four technical replicates in each mixture are included to create a set of n = 36 spectra 
{xj}^]^. We designate Pi as the COIN whose concentration within each mixture defines y. Assum- 
ing a linear relationship between the spectra, {xj}, and the Pi-concentrations, {y^}, we estimate 
Pi. More precisely, we estimate the structure in Pi that correlates most with its concentrations, 
as manifest in this set of mixtures. The fLM is a simplistic model of this relationship between the 
concentration of Pi and its functional structure, but the physics of this technology imply it is a 
reasonable starting point. 

We present the results of three estimation methods: ridge, FPCRr and PEER. In constructing a 
PEER penalty, we note that the informative structure in Raman spectra is not that of low-frequency 
or other easily modeled features, but it may be obtainable experimentally. Therefore, we define L 
as in in which Q contains the span of COIN template spectra: Qi = span{Pfc}|^^. However, 
since a single set of templates may not faithfully represent signal in subsequent experiments (with 
new measurement errors, background and baseline noise etc), we enlarge Q by adding additional 
structure related to these templates. For this, set Q2 = span{P^, P^'}|^]^, where P^ denotes the 
derivative of spectrum P^. (Note, to form Q21 scale-based approximations to these derivatives are 
used since raw differencing of non-smooth spectra introduces noise.) Then set Q = span{Qi U Q2} 
and define L = a{I — Pq) + IPq. 

The regularization parameters in the PEER and ridge estimation processes were chosen using 
REML, as described in Section [6l For the FPCR/j estimate, we used the R-package refund [39] as 
implemented in |40j . 

Since /3 is not known (the model y = Xf3 + e is only approximate), we cannot report MSEs 
for these three methods. However, the structure of Pi is qualitatively known and by experimental 
design, y is directly associated with Pi. The goal here is that of extracting structure of the con- 
stituent spectral components as manifest in a linear model. This application is similar to the classic 
problem of multivariate calibration [5l [31] which essentially leads to a regression model using an 
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experimentally-designed set of spectra from laboratory mixes. 

The structure in the estimate here is expected to reflect the structure in Pi that is correlated 
with Pi's concentrations, y. The estimate is not, however, expected to precisely reconstruct Pi 
since Pi shares structure with the other COIN spectra not associated with y. See Figure [2] where Pi 
is plotted alongside the other COIN spectra. Now, Figure [3] shows plots of the PEER, FPCR/j and 
ridge estimates of the fLM coefficient function. The PEER estimate, Pq, provides an interpretable 
compromise between ridge, which involves no smoothing, and FPCR/?, which appears to oversmooth. 
For reference, the Pi spectrum is also plotted along with a mean-adjusted version of /3q, Pq + ^ 
(dashed line), where ^l{t) = (1/36) Y^i^iit), t G [400, 1800]. 




400 600 800 1000 1200 1400 1600 1800 



Figure 3: Three estimates for a coefficient function that relates concentrations of Pi to its signal in 8-COIN 
laboratory mixtures. Estimates shown: ridge (/^ridgc)' FPC!R_r {Pf) and PEER (Pq). For perspective, Pi is 

plotted (in red) and the mean-adjusted PEER estimate, Pq + (daslied blue); ^ is the mean of the mixture 
spectra {xi}^l^ (not shown). 

Finally, we consider prediction for these methods by forming a new set of spectra from different 
mixture compositions (different concentrations of each COIN) and, additionally, taken from different 
batches. This "test" set consists of spectra from four technical replicates in each of 15 mixtures 
forming a set of n = 60 spectra, {x^'^'^*}"^]^. As before. Pi is the COIN whose concentration within 
each mixture defines the values {y^^^^jiLi- For the estimates from each of the three methods (shown 
in Figure [3]) we compute the prediction error: {^/n)'^^{yl^^^ — {x\'^^^ , P))"^ . The errors for PEER, 
ridge, and FPCR/j estimates are 0.770, 0.752, 2.139, respectively. The ridge estimate here illustrates 
how low prediction error is not necessarily accompanied by interpretable structure in the estimate 
(or low MSE) [7]. 

8 Discussion 

As high-dimensional regression problems become more common, methods that exploit a priori infor- 
mation are increasingly popular. In this regard, many approaches to penalized regression are now 
founded on the idea of "structured" penalties which impose constraints based on prior knowledge 
about the problem's scientific setting. There are many ways in which such constraints may be im- 
posed, and we have focused on the algebraic aspects of a penalization process that imposes spatial 



22 



structure directly into a regularized estimation. This approach fits into the classic framework of 
L^-penalized regression but with an emphasis on the algebraic role that a penalty operator plays to 
impart structure on the estimate. 

The interplay between a structured regularization term and the coefficient-function estimate may 
not be well understood in part because it is not typically viewed in terms of the generalized singular 
vectors/ values, which is fundamental to this investigation. In particular, any penalized estimate of 
the form ([1]) with L 7^ / is intrinsically based on GSVD factors in the same way that many common 
regression methods (such as PGR, ridge, James-Stein, or partial least squares) are intrinsically 
based on SVD factors. Just as the basics of the ubiquitous SVD are important to understanding 
these methods, we have aspired to established the basics of the GSVD as it applies to a this general 
penalized regression setting and to illustrate how the theory underlying this approach can be used 
inform the choice of penalty operator. 

Toward this goal the presentation emphasizes the transparency provided by the partially-empirical 
eigenvector expansion ([7]). Properties of the estimate's variance and bias are determined explicitly 
by the generalized singular vectors whose structure is determined by the penalty operator. We have 
restricted attention to additive constraints defined by penalty operators on Lp' in order to retain 
the direct algebraic connection between the eigenproperties of the operator pair (X, L) and the 
spatial structure of f3a,L- Intuitively, the structure of the penalty's least-dominant singular vectors 
should be commensurate with the informative structure of j3. The actual effect a penalty has on 
the properties of the estimate can be quantified in terms of the GSVD vectors/values. 

This perspective differs from popular two-stage signal regression methods in which estimation is 
either preceded by fitting the predictors to a set of (external) basis functions or is followed by a step 
that smooths the estimate [8l[2Tl[30l[38lllO]. Instead, structure (smoothness or otherwise) is imposed 
directly into the estimation process. The implementation of a penalty that incorporates structure 
less generic than smoothness (or sparseness) requires some qualitative knowledge about spatial 
structure that is informative. Glearly this is not possible in all situations, but our presentation has 
focused on how a functional linear model may provide a rigorous and analytically tractable way to 
take advantage of such knowledge when it exists. 
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9 Appendix 

9.1 Defining (3 for the simulation in section 17.21 

This simulation is motivated by an interest in constructing a plausibly realistic /3 whose structure 
is naturally derived by the scientific setting involving Raman signatures of nanoparticles. Although 
one could model a (3 mathematically using, say, polynomials or Gaussian bumps (cf.. Appendix 
A. 2), such a simulation would be detached from the physical nature of this problem. Instead, we 
construct a coefficient function that genuinely comes from a functional linear model with Raman 
spectra as predictors. 

We first generate in-silico mixtures of COIN spectra as x° = Yl^=i'^i,kPki i = l,---,50, where 
Ci^k ~ unif[0, 1]. Designating Pg as the COIN of interest, we define response values that correspond 
to the "concentration" of Pg by setting y° := Scj^g, i = l,...,n. The factor of 3 imposes a strong 
association between Pg and the response. 

Now, the example in section 17.21 aims to estimate a coefficient function that truly comes from a 
solution to a linear model. However, the equation y" = X°(3 has infinitely many solutions (where 
X° is the matrix whose iih. row is x°), so we must we must regularize the problem to obtain a 
specific /?. For this, we simply use a ridge penalty and designate the resulting solution to be j3. 
This is shown by the dotted curve in Figure [2] and is qualitatively similar to Pg. 

We note that the simulation in section \T72\ uses the same set of COINs, but a new set of in-silico 
mixture spectra (i.e., a new set of {cj^fc} ~ Unif[0, 1]). In addition, a small amount of measurement 
error was added, as in section 17. H to each spectrum during the simulation. 

9.2 Frequency domain simulation 

We display results from a study that mimics the scenario of simulations studied by Hall and Horowitz 
[H]. We illustrate, in particular, properties of the MSE discussed following equation ([T9|) in sec- 
tion 15.31 relating to 6 = 0. In fact, we consider the more general scenario in which Q is not 
constructed from empirical eigenvectors (as in PCR and ridge), but is defined by a prespecified 
envelope of frequencies. 

In this simulation both (3 and Xj, i = 1, . . . , n, are generated as sums of the cosine functions 

40 

= + ^^(*)' ^(*) = 0-75'/'5(0 + l.^Mt) + iMt), t G [0, 1], 

where 7^- = j^^-'^^ , Zij is uniformly distributed on [-3^/2^3^/2] [E{Zij) = and var(Zij) = 

1), 4)1 = 1 and (l)j{t) = 2^/2 cos(j7rt) for j > 1, and ei{t) ~ N{0,a'j^), and cov(ei(t), e^/ (t')) = for 
either i 7^ or t 7^ t' . The response is defined as = (/?, Xi) + ej, where ~ N{0, a^), i.i.d.. The 
simulations involve discretizations of these curves evaluated at p = 100 equally spaced time points, 
tj, j = 1, . . . ,p, that are common to all curves. 

Penalties. We consider properties of estimates from a variety of penalties: ridge (L = /), V'^, 
T)^ + al, and PCR^. In addition, targeted penalties of the form L = I — Pq, are defined by the 

^PCR is not obtained explicitly from a penalty operator, but see Corollary 15.31 
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specified subspaces Q = span{ for defined above. Specifically, we use J = Jf = {j = 
5, 17} (a tight envelope of frequencies) to define Lp, and J = Jg = {j = 4, 20} (a less focused 
span of frequencies) to define Lq- The operator + al simply serves to illustrate the role of 
higher- frequency singular vectors as discussed in Section 14.11 In the simulations, the coefficient a 
in "D^ + al was chosen simultaneously with a via a two-dimensional grid search. 
Simulation results. Table |4] summarizes estimation results for all six penalties and two sample 
sizes, n = 50,200. The prediction results for these estimates are in Table [5l These are reported for 
S/N = 10, 5 and = 0.8, 0.6. The number of terms in the PGR estimate was optimized and ranged 
from 19 to 25 when = 0.8 and decreased with decreasing R^. Analogously, one could optimize 
over the dimension of Q (to implement a truncated GSVD), but the purpose here is illustrative 
while in practice a more robust approach would emply a penalty of the form ()14p . 



Table 4: Estimation errors (MSE) for the simulation with localized frequencies. 



n 




S/N 


Lf 


Lg 


PGR 


ridge 


V' 




50 


0.8 


10 


42.42 


77.31 


123.60 


132.50 


1051.20 


568.45 


50 


0.8 


5 


41.55 


75.41 


128.75 


143.48 


447.64 


184.07 


200 


0.8 


10 


8.28 


13.48 


33.44 


65.78 


453.54 


169.41 


200 


0.8 


5 


8.56 


13.08 


36.36 


87.59 


100.15 


76.24 


50 


0.6 


10 


106.89 


200.08 


173.56 


173.94 


1098.20 


631.05 


50 


0.6 


5 


109.51 


178.05 


178.15 


196.62 


612.12 


259.92 


200 


0.6 


10 


25.30 


38.73 


58.90 


98.13 


847.59 


350.46 


200 


0.6 


5 


22.08 


33.79 


59.92 


119.48 


240.09 


127.52 



Errors obtained with ridge and PCR are small, as expected, since the structure of /3 in this 
example is consistent with the structure represented in the singular vectors, Vk- Therefore, even 
though the relationship between the yi and Xj degrades (indeed, even as i?^ — )■ 0), these estimates 
are comprised of vectors that generally capture structure in f3 since it is strongly represented by 
the dominant eigenstructure of X. The second-derivative penalty, V^, produces the worst estimate 
in each of the scenarios due to oversmoothing. Note + al improves on P^, yet it is still not 
optimal for the range of frequencies in /?. 

Regarding Lg, the MSE gets worse as S/N increases. Indeed, here Q is fixed and relatively 
large and since the decay faster when S/N is big, this leads to rank deficiency and large variance; 
see equation (fT9]l (note, this only applies approximately since Q does not consist of ordinary SVs). 
In our previous examples, this is stabilized by a 6 > 0. 

The problems of estimation and prediction have different properties [7] ; good prediction may be 
obtained even with a poor estimate, as seen in Table El The estimate from Lf)^ is generally poor 
relative to others (as measured by the L^-norm), but its prediction error is comparable with other 
methods and is best among the non-targeted penalization methods. This is consistent with the out- 
come described by C. Goutis [Hj where (derivatives of) the predictor curves contain sharp features 
and so standard least-squares regularization (OLS, PGR, ridge, etc.) perform worse than a PEER 
estimate which imposes a greater emphasis on "regularly oscillatory but not smooth components" ; 
see section dTj 
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Table 5: Prediction errors for the simulation with localized frequencies. 



n 




S/N 


Lf 


Lg 


PGR 


ridge 




+ al 


50 


0.8 


10 


0.848 


1.134 


1.490 


1.423 


1.292 


1.246 


OU 


U.o 


r 




n Q /I n 
U.o4U 




1 /I O V 

i.4z / 


i.oyU 


i.C)U4 


l.ZZZ 


200 


0.8 


10 


0.200 


0.273 


0.432 


0.497 


0.444 


0.418 


200 


0.8 


5 


0.211 


0.276 


0.460 


0.547 


0.466 


0.455 


50 


0.6 


10 


2.165 


2.900 


3.051 


2.705 


2.621 


2.472 


50 


0.6 


5 


2.171 


2.832 


3.158 


2.938 


2.912 


2.724 


200 


0.6 


10 


0.621 


0.801 


1.058 


1.160 


1.044 


0.990 


200 


0.6 


5 


0.584 


0.766 


1.062 


1.186 


1.069 


1.014 
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